Skip to content

[KV Offload] Refactor CPU offloading: pluggable CachePolicy, remove Backend abstraction, restructure into cpu/ package#37874

Merged
orozery merged 7 commits intovllm-project:mainfrom
ronensc:cpu-manager
Mar 24, 2026
Merged

[KV Offload] Refactor CPU offloading: pluggable CachePolicy, remove Backend abstraction, restructure into cpu/ package#37874
orozery merged 7 commits intovllm-project:mainfrom
ronensc:cpu-manager

Conversation

@ronensc
Copy link
Copy Markdown
Contributor

@ronensc ronensc commented Mar 23, 2026

Purpose

Refactor the CPU KV-cache offloading subsystem to reduce duplication, remove
unnecessary abstraction layers, and improve code organization.

Consolidate LRU/ARC managers using a strategy pattern

arc_manager.py and lru_manager.py duplicated ~40 lines of identical skeleton
code: take_events, event emission, ref-count management in prepare_load/
complete_load, backend.allocate_blocks(), and __init__ boilerplate. These are
now unified in CPUOffloadingManager, with policy-specific logic isolated in
CachePolicy implementations.

Remove the Backend abstraction

The Backend ABC and CPUBackend class added a layer of indirection with no
polymorphism benefit. The block pool logic is now inlined directly into
CPUOffloadingManager as private methods.

Restructure into a cpu/ package

Per reviewer suggestion, the flat files are split into a proper subdirectory with
one responsibility per file:

vllm/v1/kv_offload/cpu/
    manager.py           # CPUOffloadingManager + block pool
    spec.py              # CPUOffloadingSpec
    policies/
        abstract.py      # BlockStatus + CachePolicy ABC
        lru.py           # LRUCachePolicy
        arc.py           # ARCCachePolicy

Key design decisions

  • CachePolicy abstract base covers both block organization and replacement
    decisions — LRU and ARC differ in both, so they cannot be cleanly split
  • Policy selection uses a _CACHE_POLICIES registry dict — adding a new policy
    requires no changes to CPUOffloadingManager
  • Both evict() implementations are now atomic: candidates are collected
    without mutating state; changes only apply if all n evictions can be satisfied
    (the original ARC code could partially evict before returning None)
  • LRUCachePolicy.touch() fixed: if self.blocks.get(hash):if hash in self.blocks:
    (the old check was unreliable for blocks with ref_cnt == 0)

Files deleted: arc_manager.py, lru_manager.py, cpu.py, cpu_manager.py,
backend.py, backends/cpu.py

Test Plan

pytest tests/v1/kv_offload/test_cpu_manager.py -v -s

Test Result

13 passed in 4.88s

cc @orozery @albertoperdomo2


Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

ronensc added 3 commits March 23, 2026 10:46
Signed-off-by: Ronen Schaffer <ronen.schaffer@ibm.com>
Signed-off-by: Ronen Schaffer <ronen.schaffer@ibm.com>
Signed-off-by: Ronen Schaffer <ronen.schaffer@ibm.com>
@ronensc ronensc requested review from ApostaC and orozery as code owners March 23, 2026 09:28
@mergify mergify bot added the v1 label Mar 23, 2026
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request is an excellent refactoring that consolidates the LRUOffloadingManager and ARCOffloadingManager into a single CPUOffloadingManager using a strategy pattern with pluggable CachePolicy implementations. This significantly reduces code duplication and improves modularity. A key improvement is making the eviction logic atomic, preventing partial evictions on failure. The implementation of the strategy pattern, including the CachePolicy ABC and the policy registry, is clean and extensible. The changes are well-tested. I have one suggestion to improve the robustness of the LRUCachePolicy.

self.blocks[block_hash] = block

def remove(self, block_hash: BlockHash) -> None:
del self.blocks[block_hash]
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The remove method in LRUCachePolicy uses del self.blocks[block_hash], which will raise a KeyError if the block hash is not present. While the current call site in CPUOffloadingManager.complete_store ensures the key exists before calling remove, making this method more robust would be beneficial for future maintenance and to prevent potential crashes if the calling logic changes. The ARCCachePolicy implementation already uses a safer pop method. Consider using self.blocks.pop(block_hash, None) to silently handle cases where the block is not in the cache, which improves defensiveness.

Suggested change
del self.blocks[block_hash]
self.blocks.pop(block_hash, None)

)
self._manager = CPUOffloadingManager(
backend=backend,
cache_policy=cast(Literal["lru", "arc"], self.eviction_policy),
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO since self.eviction_policy comes from user config (line 59), wouldn't be cleaner to pass the raw string with # type: ignore[arg-type] and let CPUOffloadingManager.__init__ handle the validation? It already raises ValueError for unknown policies, and it feels like cast() silences mypy but doesn't validate at runtime.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, good point. I'll replace cast() with # type: ignore[arg-type] and rely on the runtime validation in CPUOffloadingManager.__init__.

Copy link
Copy Markdown
Collaborator

@orozery orozery left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would suggest the following file structure:

vllm/v1/kv_offload/cpu/manager.py
vllm/v1/kv_offload/cpu/spec.py
vllm/v1/kv_offload/cpu/policies/abstract.py
vllm/v1/kv_offload/cpu/policies/lru.py
vllm/v1/kv_offload/cpu/policies/arc.py

@ronensc WDYT?


def __init__(
self,
backend: Backend,
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should remove the Backend abstraction and merge the code of CPUBackend inside CPUOffloadingManager.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, this makes sense. I'll merge CPUBackend into CPUOffloadingManager.

I also like the proposed file structure. A couple of questions:

  1. Should the worker/ directory also be moved under cpu/?
  2. Do you prefer handling the file structure reorganization in this PR, or as a follow-up PR?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Should the worker/ directory also be moved under cpu/?

We need to decide if we want to split directories based on worker/scheduler.
Let's think about that later.

  1. Do you prefer handling the file structure reorganization in this PR, or as a follow-up PR?

Let's do this here.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.
@orozery ready for a 2nd review round.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated PR title and description

ronensc added 4 commits March 23, 2026 14:00
Signed-off-by: Ronen Schaffer <ronen.schaffer@ibm.com>
Signed-off-by: Ronen Schaffer <ronen.schaffer@ibm.com>
Signed-off-by: Ronen Schaffer <ronen.schaffer@ibm.com>
Signed-off-by: Ronen Schaffer <ronen.schaffer@ibm.com>
@ronensc ronensc changed the title [KV Offload] Consolidate LRU/ARC managers into CPUOffloadingManager with pluggable CachePolicy [KV Offload] Refactor CPU offloading: pluggable CachePolicy, remove Backend abstraction, restructure into cpu/ package Mar 23, 2026
@orozery orozery added the ready ONLY add when PR is ready to merge/full CI is needed label Mar 23, 2026
Copy link
Copy Markdown
Collaborator

@orozery orozery left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @ronensc ! LGTM!

@orozery orozery merged commit e3c6c10 into vllm-project:main Mar 24, 2026
46 checks passed
@ronensc ronensc deleted the cpu-manager branch March 24, 2026 05:43
RhizoNymph pushed a commit to RhizoNymph/vllm that referenced this pull request Mar 26, 2026
…ackend abstraction, restructure into `cpu/` package (vllm-project#37874)

Signed-off-by: Ronen Schaffer <ronen.schaffer@ibm.com>
HenryTangDev pushed a commit to HenryTangMain/vllm that referenced this pull request Mar 27, 2026
…ackend abstraction, restructure into `cpu/` package (vllm-project#37874)

Signed-off-by: Ronen Schaffer <ronen.schaffer@ibm.com>
khairulkabir1661 pushed a commit to khairulkabir1661/vllm that referenced this pull request Mar 27, 2026
…ackend abstraction, restructure into `cpu/` package (vllm-project#37874)

Signed-off-by: Ronen Schaffer <ronen.schaffer@ibm.com>
Monishver11 pushed a commit to Monishver11/vllm that referenced this pull request Mar 27, 2026
…ackend abstraction, restructure into `cpu/` package (vllm-project#37874)

Signed-off-by: Ronen Schaffer <ronen.schaffer@ibm.com>
Signed-off-by: Monishver Chandrasekaran <monishverchandrasekaran@gmail.com>
nithinvc pushed a commit to nithinvc/vllm that referenced this pull request Mar 27, 2026
…ackend abstraction, restructure into `cpu/` package (vllm-project#37874)

Signed-off-by: Ronen Schaffer <ronen.schaffer@ibm.com>

Signed-off-by: Nithin Chalapathi <nithin.ch10@gmail.com>
JiantaoXu pushed a commit to JiantaoXu/vllm that referenced this pull request Mar 28, 2026
…ackend abstraction, restructure into `cpu/` package (vllm-project#37874)

Signed-off-by: Ronen Schaffer <ronen.schaffer@ibm.com>
vrdn-23 pushed a commit to vrdn-23/vllm that referenced this pull request Mar 30, 2026
…ackend abstraction, restructure into `cpu/` package (vllm-project#37874)

Signed-off-by: Ronen Schaffer <ronen.schaffer@ibm.com>
Signed-off-by: Vinay Damodaran <vrdn@hey.com>
EricccYang pushed a commit to EricccYang/vllm that referenced this pull request Apr 1, 2026
…ackend abstraction, restructure into `cpu/` package (vllm-project#37874)

Signed-off-by: Ronen Schaffer <ronen.schaffer@ibm.com>
Signed-off-by: EricccYang <yangyang4991@gmail.com>
MengqingCao pushed a commit to vllm-project/vllm-ascend that referenced this pull request Apr 3, 2026
### What this PR does / why we need it?
main2main upgrade to vllm 0324.
fix breaks:

1. PR [#37487](vllm-project/vllm#37487) [V0
Deprecation] Refactor kv cache from list to element (c59a132f9) —
self.kv_cache from list[tensor](per virtual engine)changed to tensor

2. PR [#37874](vllm-project/vllm#37874) [KV
Offload] Refactor CPU offloading: pluggable CachePolicy, remove Backend
abstraction, restructure into cpu/ package (e3c6c10ca) —
LRUOffloadingManager + CPUBackend been refactor to CPUOffloadingManager

3. PR [#32951](vllm-project/vllm#32951)
[Async][Spec Decoding] Zero-bubble async scheduling + spec decoding
(fafe76b4a) — a) changes self.positions and self.seq_lens from
CpuGpuBuffer to plain GPU tensor; b) change _get_cumsum_and_arange
output paramter. Another _prepare_input_ids add num_reqs.

5. PR [#35007](https://github.com/vllm-project/vllm/pull/35007)[Bugfix]
Register VLLM_BATCH_INVARIANT in envs.py to fix spurious unknown env var
warning (dc6908ac6) — delete vllm_is_batch_invariant() and const
variable VLLM_BATCH_INVARIANT,replace with vllm.envs

Know issues:
1.310p Qwen3.5 test failed for qwen3.5 patch failure, see issue: #7976
@YangShuai52 is fixing.

### Does this PR introduce _any_ user-facing change?
1. As Zero Async Scheduler + spec decode needs
_compute_slot_mapping_kernel of NPU and corresponding accepted draft
token validation delaye suppots see PR #7640 , this PR make this change:
when in spec decode case close the async scheduler. In this way, the
Main2Main can be developed in parallel with Spec Decode + Async
scheduler, util next release version.

Co-Authored-By: zhaomingyu <zhaomingyu13@h-partners.com>
wangbj127 <wangbj1207@126.com>
SidaoY <1024863041@qq.com>
22dimensions <waitingwind@foxmail.com>

- vLLM main:
vllm-project/vllm@35141a7
---------
Signed-off-by: 22dimensions <waitingwind@foxmail.com>
Signed-off-by: leo-pony <nengjunma@outlook.com>
Signed-off-by: Your Name <you@example.com>
Signed-off-by: wangbj127 <wangbj1207@126.com>
Co-authored-by: 22dimensions <waitingwind@foxmail.com>
Co-authored-by: Claude Code <claude@anthropic.com>
Co-authored-by: Claude Code <noreply@anthropic.com>
Co-authored-by: Your Name <you@example.com>
Co-authored-by: wangbj127 <wangbj1207@126.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready ONLY add when PR is ready to merge/full CI is needed v1

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants